As a JIT ("Just in Time")-compiled language, Julia is designed for good performance. Currently, it is usually expected that it should usually be able to reach speeds within at most a factor of 2 of that of corresponding C code.
However, to attain decent performance, there are certain principles that must be used in code; see the Performance tips section of the Julia manual for more details.
When profiling, always run each function once with the correct argument types before timing it, since the first time it is run the compilation time will play a large role.
Pkg.update()
INFO: Updating METADATA... INFO: Updating cache of Compose... INFO: Updating cache of Gadfly... INFO: Computing changes... INFO: Cloning cache of Contour from git://github.com/tlycken/Contour.jl.git INFO: Upgrading Compose: v0.3.0 => v0.3.1 INFO: Installing Contour v0.0.1 INFO: Upgrading Gadfly: v0.3.0 => v0.3.1 INFO: Building Datetime
@time sin(10)
elapsed time: 6.941e-6 seconds (96 bytes allocated)
-0.5440211108893698
a = 3
3
Global variables are slow in Julia: do not use global variables!
Your main program should be wrapped in a function. Any time you are tempted to use globals, just send them as arguments to functions, and return them if necessary.
If you have many variables to pass around, wrap them in a type, e.g. called State
The second important idea for gaining performance is that of type stability.
Any calculation will be immediately slowed down by having variables which can change type during a calculation, simply due to the extra work that must be done at run-time to check the type of the variables. (This is one of the main reasons for the slowness of Python and the necessity for type declarations in Cython to gain speed.)
A simple example (due to Leah Hanson) is the following pair of almost-identical functions:
function sum1(N::Int)
total = 0
for i in 1:N
total += i/2
end
total
end
function sum2(N::Int)
total = 0.0
for i in 1:N
total += i/2
end
total
end
sum2 (generic function with 1 method)
We must first run the functions once each to compile them, before looking at any timings:
sum1(10), sum2(10)
(27.5,27.5)
[Happily, they produce the same result!]
N = 10000000
@time sum1(N)
@time sum2(N)
elapsed time: 0.656544448 seconds (320108552 bytes allocated, 36.27% gc time) elapsed time: 0.039772714 seconds (96 bytes allocated)
2.50000025e13
The second version is consistently over 10 times faster than the first version, due simply to type stability. It also allocates almost no memory. The first version allocates an enormous amount of memory (in fact, it is allocating and deallocating all the time), and spends a large fraction of its time in garbage collection.
To help with type stability, there are functions zero(x)
and one(x)
that return the correctly-typed zero and one with the same type as the variable x
:
Packages: Lint.jl
, TypeCheck.jl
x = 1
zero(x)
0
y = 0.5
zero(y)
0.0
x = BigFloat("0.1")
one(x)
1e+00 with 256 bits of precision
Julia gives us access to basically every step in the compilation process:
code_lowered(sum1, (Int,))
1-element Array{Any,1}: :($(Expr(:lambda, {:N}, {{:total,:#s246,:#s245,:#s244,:i},{{:N,:Any,0},{:total,:Any,2},{:#s246,:Any,18},{:#s245,:Any,2},{:#s244,:Any,18},{:i,:Any,18}},{}}, :(begin # In[23], line 2: total = 0 # line 4: #s246 = colon(1,N) #s245 = top(start)(#s246) unless top(!)(top(done)(#s246,#s245)) goto 1 2: #s244 = top(next)(#s246,#s245) i = top(tupleref)(#s244,1) #s245 = top(tupleref)(#s244,2) # line 5: total = total + i / 2 3: unless top(!)(top(!)(top(done)(#s246,#s245))) goto 2 1: 0: # line 8: return total end))))
code_lowered(sum2, (Int,))
1-element Array{Any,1}: :($(Expr(:lambda, {:N}, {{:total,:#s246,:#s245,:#s244,:i},{{:N,:Any,0},{:total,:Any,2},{:#s246,:Any,18},{:#s245,:Any,2},{:#s244,:Any,18},{:i,:Any,18}},{}}, :(begin # In[23], line 12: total = 0.0 # line 14: #s246 = colon(1,N) #s245 = top(start)(#s246) unless top(!)(top(done)(#s246,#s245)) goto 1 2: #s244 = top(next)(#s246,#s245) i = top(tupleref)(#s244,1) #s245 = top(tupleref)(#s244,2) # line 15: total = total + i / 2 3: unless top(!)(top(!)(top(done)(#s246,#s245))) goto 2 1: 0: # line 18: return total end))))
code_typed(sum1, (Int,))
1-element Array{Any,1}: :($(Expr(:lambda, {:N}, {{:total,:#s246,:#s245,:#s244,:i,:_var0,:_var1},{{:N,Int64,0},{:total,Any,2},{:#s246,UnitRange{Int64},18},{:#s245,Int64,2},{:#s244,(Int64,Int64),18},{:i,Int64,18},{:_var0,Int64,18},{:_var1,Int64,18}},{}}, :(begin # In[23], line 2: total = 0 # line 4: #s246 = $(Expr(:new, UnitRange{Int64}, 1, :(top(getfield)(Intrinsics,:select_value)(top(sle_int)(1,N::Int64)::Bool,N::Int64,top(box)(Int64,top(sub_int)(1,1))::Int64)::Int64)))::UnitRange{Int64} #s245 = top(getfield)(#s246::UnitRange{Int64},:start)::Int64 unless top(box)(Bool,top(not_int)(#s245::Int64 === top(box)(Int64,top(add_int)(top(getfield)(#s246::UnitRange{Int64},:stop)::Int64,1))::Int64::Bool))::Bool goto 1 2: _var0 = #s245::Int64 _var1 = top(box)(Int64,top(add_int)(#s245::Int64,1))::Int64 i = _var0::Int64 #s245 = _var1::Int64 # line 5: total = total::Union(Int64,Float64) + top(box)(Float64,top(div_float)(top(box)(Float64,top(sitofp)(Float64,i::Int64))::Float64,top(box)(Float64,top(sitofp)(Float64,2))::Float64))::Float64::Float64 3: unless top(box)(Bool,top(not_int)(top(box)(Bool,top(not_int)(#s245::Int64 === top(box)(Int64,top(add_int)(top(getfield)(#s246::UnitRange{Int64},:stop)::Int64,1))::Int64::Bool))::Bool))::Bool goto 2 1: 0: # line 8: return total::Union(Int64,Float64) end::Union(Int64,Float64)))))
code_typed(sum2, (Int,))
1-element Array{Any,1}: :($(Expr(:lambda, {:N}, {{:total,:#s246,:#s245,:#s244,:i,:_var0,:_var1},{{:N,Int64,0},{:total,Float64,2},{:#s246,UnitRange{Int64},18},{:#s245,Int64,2},{:#s244,(Int64,Int64),18},{:i,Int64,18},{:_var0,Int64,18},{:_var1,Int64,18}},{}}, :(begin # In[23], line 12: total = 0.0 # line 14: #s246 = $(Expr(:new, UnitRange{Int64}, 1, :(top(getfield)(Intrinsics,:select_value)(top(sle_int)(1,N::Int64)::Bool,N::Int64,top(box)(Int64,top(sub_int)(1,1))::Int64)::Int64)))::UnitRange{Int64} #s245 = top(getfield)(#s246::UnitRange{Int64},:start)::Int64 unless top(box)(Bool,top(not_int)(#s245::Int64 === top(box)(Int64,top(add_int)(top(getfield)(#s246::UnitRange{Int64},:stop)::Int64,1))::Int64::Bool))::Bool goto 1 2: _var0 = #s245::Int64 _var1 = top(box)(Int64,top(add_int)(#s245::Int64,1))::Int64 i = _var0::Int64 #s245 = _var1::Int64 # line 15: total = top(box)(Float64,top(add_float)(total::Float64,top(box)(Float64,top(div_float)(top(box)(Float64,top(sitofp)(Float64,i::Int64))::Float64,top(box)(Float64,top(sitofp)(Float64,2))::Float64))::Float64))::Float64 3: unless top(box)(Bool,top(not_int)(top(box)(Bool,top(not_int)(#s245::Int64 === top(box)(Int64,top(add_int)(top(getfield)(#s246::UnitRange{Int64},:stop)::Int64,1))::Int64::Bool))::Bool))::Bool goto 2 1: 0: # line 18: return total::Float64 end::Float64))))
code_llvm(sum1, (Int, ))
define %jl_value_t* @"julia_sum1;19538"(i64) { top: %1 = alloca [5 x %jl_value_t*], align 8 %.sub = getelementptr inbounds [5 x %jl_value_t*]* %1, i64 0, i64 0 %2 = getelementptr [5 x %jl_value_t*]* %1, i64 0, i64 2, !dbg !2590 store %jl_value_t* inttoptr (i64 6 to %jl_value_t*), %jl_value_t** %.sub, align 8 %3 = load %jl_value_t*** @jl_pgcstack, align 8, !dbg !2590 %4 = getelementptr [5 x %jl_value_t*]* %1, i64 0, i64 1, !dbg !2590 %.c = bitcast %jl_value_t** %3 to %jl_value_t*, !dbg !2590 store %jl_value_t* %.c, %jl_value_t** %4, align 8, !dbg !2590 store %jl_value_t** %.sub, %jl_value_t*** @jl_pgcstack, align 8, !dbg !2590 %5 = getelementptr [5 x %jl_value_t*]* %1, i64 0, i64 3 store %jl_value_t* null, %jl_value_t** %5, align 8 %6 = getelementptr [5 x %jl_value_t*]* %1, i64 0, i64 4 store %jl_value_t* null, %jl_value_t** %6, align 8 store %jl_value_t* inttoptr (i64 140474354759232 to %jl_value_t*), %jl_value_t** %2, align 8, !dbg !2591 %7 = icmp sgt i64 %0, 0, !dbg !2592 br i1 %7, label %L, label %L3, !dbg !2592 L: ; preds = %top, %L %8 = phi %jl_value_t* [ %16, %L ], [ inttoptr (i64 140474354759232 to %jl_value_t*), %top ], !dbg !2592 %"#s245.0" = phi i64 [ %9, %L ], [ 1, %top ] %9 = add i64 %"#s245.0", 1, !dbg !2592 store %jl_value_t* %8, %jl_value_t** %5, align 8, !dbg !2593 %10 = sitofp i64 %"#s245.0" to double, !dbg !2593 %11 = fmul double %10, 5.000000e-01, !dbg !2593 %12 = call %jl_value_t* @alloc_2w(), !dbg !2593 %13 = getelementptr inbounds %jl_value_t* %12, i64 0, i32 0, !dbg !2593 store %jl_value_t* inttoptr (i64 140474354684320 to %jl_value_t*), %jl_value_t** %13, align 8, !dbg !2593 %14 = getelementptr inbounds %jl_value_t* %12, i64 1, i32 0, !dbg !2593 %15 = bitcast %jl_value_t** %14 to double*, !dbg !2593 store double %11, double* %15, align 8, !dbg !2593 store %jl_value_t* %12, %jl_value_t** %6, align 8, !dbg !2593 %16 = call %jl_value_t* @jl_apply_generic(%jl_value_t* inttoptr (i64 140474385387040 to %jl_value_t*), %jl_value_t** %5, i32 2), !dbg !2593 store %jl_value_t* %16, %jl_value_t** %2, align 8, !dbg !2593 %17 = icmp eq i64 %"#s245.0", %0, !dbg !2593 br i1 %17, label %L3, label %L, !dbg !2593 L3: ; preds = %L, %top %18 = phi %jl_value_t* [ inttoptr (i64 140474354759232 to %jl_value_t*), %top ], [ %16, %L ] %19 = load %jl_value_t** %4, align 8, !dbg !2594 %20 = getelementptr inbounds %jl_value_t* %19, i64 0, i32 0, !dbg !2594 store %jl_value_t** %20, %jl_value_t*** @jl_pgcstack, align 8, !dbg !2594 ret %jl_value_t* %18, !dbg !2594 }
code_native(sum1, (Int,))
.section __TEXT,__text,regular,pure_instructions Filename: In[23] Source line: 2 push RBP mov RBP, RSP push R15 push R14 push R13 push R12 push RBX sub RSP, 56 mov R12, RDI mov QWORD PTR [RBP - 80], 6 Source line: 2 movabs RCX, 4463631920 mov RAX, QWORD PTR [RCX] mov QWORD PTR [RBP - 72], RAX lea RAX, QWORD PTR [RBP - 80] mov QWORD PTR [RCX], RAX mov QWORD PTR [RBP - 56], 0 mov QWORD PTR [RBP - 48], 0 movabs RAX, 140474354759232 Source line: 2 mov QWORD PTR [RBP - 64], RAX test R12, R12 jle 121 mov EBX, 1 Source line: 5 movabs R13, 4451150720 movabs R15, 140474354684320 movabs RCX, 4605169152 vmovsd XMM0, QWORD PTR [RCX] vmovsd QWORD PTR [RBP - 88], XMM0 movabs R14, 4450807376 mov QWORD PTR [RBP - 56], RAX call R13 mov QWORD PTR [RAX], R15 vcvtsi2sd XMM0, XMM0, RBX vmulsd XMM0, XMM0, QWORD PTR [RBP - 88] vmovsd QWORD PTR [RAX + 8], XMM0 mov QWORD PTR [RBP - 48], RAX movabs RDI, 140474385387040 lea RSI, QWORD PTR [RBP - 56] mov EDX, 2 call R14 Source line: 4 inc RBX Source line: 5 dec R12 mov QWORD PTR [RBP - 64], RAX jne -67 Source line: 8 mov RCX, QWORD PTR [RBP - 72] Source line: 2 movabs RDX, 4463631920 Source line: 8 mov QWORD PTR [RDX], RCX add RSP, 56 pop RBX pop R12 pop R13 pop R14 pop R15 pop RBP ret
Simple profiling of a function may be achieved using the @time
macro
A detailed profile may be obtained using @profile
.
A graphical view is available via the ProfileView.jl
package.
@profile sum1(10000000)
2.50000025e13
f(N) = sum1(N)
f (generic function with 1 method)
@profile f(10000000)
2.50000025e13